In this paper we propose a new method of speaker diarization that employs adeep learning architecture to learn speaker embeddings. In contrast to thetraditional approaches that build their speaker embeddings using manuallyhand-crafted spectral features, we propose to train for this purpose arecurrent convolutional neural network applied directly on magnitudespectrograms. To compare our approach with the state of the art, we collect andrelease for the public an additional dataset of over 6 hours of fully annotatedbroadcast material. The results of our evaluation on the new dataset and threeother benchmark datasets show that our proposed method significantlyoutperforms the competitors and reduces diarization error rate by a largemargin of over 30% with respect to the baseline.
展开▼